16-726 Learning-Based Image Synthesis (Spring 2025)

Assignment #5 - Cats Photo Editing

By: Lamia Alsalloom

Part 1: Inverting the Generator


In this section, we invert the generator by solving a nonconvex optimization problem in the latent space of a pretrained StyleGAN model using the w+ representation. The following results are obtained by applying different combinations of loss functions during inversion, including an Lp (L1) loss that enforces pixel level similarity, a perceptual loss that preserves high level features from a pretrained network, and an L2 regularization term on the latent update (delta) to constrain the optimization. The following outputs below show how each combination impacts the reconstruction quality, highlighting improvements in image detail and background fidelity when using the perceptual loss and regularization .

1. various combinations of the losses including Lp loss, Preceptual loss and/or regularization loss that penalizes L2 norm of delta:

Original

Original Image

conv_1_noise1

L1 pixel loss weight 0

perc loss weight 0.01

Regularization Loss Weight 0.001

conv_1_noise2

L1 pixel loss weight: 0

perc loss weight: 0

Regularization Loss Weight: 0.001

conv_2_noise1

L1 pixel loss weight: 1

perc loss weight: 1

Regularization Loss Weight: 0.1

conv_3_noise1

L1 pixel loss weight: 10

perc loss weight: 1

Regularization Loss Weight: 0.1

conv_3_noise2

L1 pixel loss weight: 10

perc loss weight: 0

Regularization Loss Weight: 0

conv_4_noise1

L1 pixel loss weight: 0

perc loss weight: 0

Regularization Loss Weight: 0.1

conv_4_noise2

L1 pixel loss weight: 0.1

perc loss weight: 0.01

Regularization Loss Weight: 0.001





2. different generative models including vanilla GAN, StyleGAN:

The following results are obtained by optimizing a combination of L1 and perceptual losses over 1000 iterations on the z latent space. All experiments are run on A100 GPU. Our experiments demonstrate that although the vanilla GAN is computationally efficient (~40s) due to its streamlined architecture, its reconstructions are significantly less detailed and realistic compared to the higher fidelity outputs generated by StyleGAN (~47s).

Original

Original Image

Vanilla GAN

Vanilla GAN

StyleGAN

StyleGAN



3. different latent space (latent code in z space, w space, and w+ space):

The following results are obtained by using different latent spaces on the StyleGAN model, while optimizing a combination of L1 and perceptual losses over 1000 iterations. All experiments are run on A100 GPU, each took around (~47s).

Latent Space
Example 1
Example 2
original image
Original Example 1
Original Example 2
z space
z space Example 1
z space Example 2
w space
w space Example 1
w space Example 2
w+ space
w+ space Example 1
w+ space Example 2

4. Give comments on why the various outputs look how they do. Which combination gives you the best result and how fast your method performs:

- When using the z latent space for StyleGAN, the optimization becomes particularly challenging because gradients must pass through the additional mapping network, often leading to reconstructions that remain very close to the initial state rather than converging to the input.

- Both w and w+ latent spaces provide improved reconstruction quality, the w+ space delivers greater detail and background fidelity owing to its increased expressiveness and flexibility.

- experiments show that vanilla GAN inversions take about 40 seconds per image, while inversions based on StyleGAN require roughly 47 seconds per image.

- The results indicate that GAN outputs tend to be unstable and that using perceptual loss is essential for generating images that closely match the reference image, whereas regularizing the latent update (delta) has little effect.

- The best performance in our experiments is achieved using StyleGAN with either the w or w+ latent spaces, as these methods capture the input image more accurately than using the z space, and StyleGAN outperforms vanilla GAN for inversion tasks.

- The best result is obtained by using the StyleGAN with w+ space, Lp loss weight 10, perceptual loss weight 0.01, and regularization loss weight 0.0. The method performs 1000 iterations within 47s.

Part 2: Scribble to Image


Sketch Image
Mask Image
Image
Sketch 1
Mask 1
Image 1
Sketch 2
Mask 2
Image 2
Sketch 3
Mask 3
Image 3
Sketch 4
Mask 4
Image 4
Sketch 5
Mask 5
Image 5
Sketch 6
Mask 6
Image 6
Sketch 7
Mask 7
Image 7
Sketch 8
Mask 8
Image 8
Sketch 9
Mask 9
Image 9
Sketch 9 (repeat or placeholder)

I drew this cat

Mask 10
Image 10
Sketch 9 (repeat or placeholder)

I drew this cat

Mask 11
Image 11
Using StyleGAN with the w+ latent space and optimizing with L1 and perceptual losses over 1000 iterations, most outputs accurately reflect the mask and capture essential details. However, in some cases the generated image remains too similar to the original training sample or introduces unrealistic features to satisfy the sketch constraints (7th and 9th rows). The color of the sketch itself also plays a critical role, as atypical hues or backgrounds can lead to less convincing results (last two rows).

Part 3: Stable Diffusion


Show some example outputs of your guided image synthesis on at least 2 different input images:

Input Image
Prompt
Strength
Steps
Output
Sketch 1

Grumpy cat reimagined as a royal painting

15
700
Image 1
Sketch 2

Grumpy cat reimagined as a royal painting

15
700
Image 2


Show some example outputs of your guided image synthesis on 2 different amounts of noises added to the input:

Input Image
Prompt
Strength
Steps
Output
Sketch 1

A regal oil painting of a grumpy looking cat, highly detailed, realistic fur texture, dramatic lighting, vintage art style

15
500
Image 1
Sketch 1

A regal oil painting of a grumpy looking cat, highly detailed, realistic fur texture, dramatic lighting, vintage art style

15
700
Image 2
Sketch 1

A regal oil painting of a grumpy looking cat, highly detailed, realistic fur texture, dramatic lighting, vintage art style

15
1000
Image 2


Show some example outputs of your guided image synthesis on 2 different classifier-free guidance strength values:
For this part, we fix steps to 700 and use the same prompt on multiple sketch images with varying values of guidance strength.
Prompt: A regal oil painting of a grumpy looking cat, highly detailed, realistic fur texture, dramatic lighting, vintage art style.

Input Image
Strength 3.5
Strength 6.0
Strength 15
Sketch 1
Strength 3.5 Image 1
Strength 6.0 Image 1
Strength 15 Image 1
Sketch 2
Strength 3.5 Image 2
Strength 6.0 Image 2
Strength 15 Image 2
Sketch 3
Strength 3.5 Image 3
Strength 6.0 Image 3
Strength 15 Image 3
Sketch 4
Strength 3.5 Image 4
Strength 6.0 Image 4
Strength 15 Image 4
Sketch 5
Strength 3.5 Image 5
Strength 6.0 Image 5
Strength 15 Image 5
Sketch 6
Strength 3.5 Image 6
Strength 6.0 Image 6
Strength 15 Image 6
Sketch 7
Strength 3.5 Image 7
Strength 6.0 Image 7
Strength 15 Image 7
Sketch 8
Strength 3.5 Image 8
Strength 6.0 Image 8
Strength 15 Image 8
Sketch 9
Strength 3.5 Image 9
Strength 6.0 Image 9
Strength 15 Image 9
Sketch 10
Strength 3.5 Image 10
Strength 6.0 Image 10
Strength 15 Image 10
Sketch 11
Strength 3.5 Image 11
Strength 6.0 Image 11
Strength 15 Image 11
Sketch 12
Strength 3.5 Image 12
Strength 6.0 Image 12
Strength 15 Image 12

Bells & Whistles (Extra Points)


Interpolate between two latent codes in the GAN model, and generate an image sequence (2pt):

Image 1
Image 2
Image 1
Image 2

Implement additional types of constraints. (3pts each): e.g., sketch/shape constraint and warping constraints mentioned in the iGAN paper, or texture constraint using a style loss.:
(1) sketch/shape constraint: We have implemented an extra edge constraint based on a Sobel operator. This constraint is a sketch/shape constraint, it computes edge maps from both the generated image and the reference sketch, and then applies an L1 loss between these edge maps. This helps to enforce that the generated image retains the structural details, such as outlines or edges, that are present in the input sketch.


0_data

0_data.png

0_mask

0_mask.png

0_250

0_250.png

0_500

0_500.png

0_750

0_750.png

0_1000

0_1000.png

The additional edge constraint nudges the generator to align with the edges of the scribble, preserving crucial contours and outlines in the final image so that it better reflects the structure of the input sketch.

We place a call in the Criterion and by setting weight for edge constraint loss to 1

        def compute_sobel_edges(img):
        """
        Compute an approximate edge map using the Sobel operator.
        img: Tensor of shape (B, C, H, W) with values in [0, 1].
        Returns a tensor of shape (B, 1, H, W) representing edge magnitudes
        """
        sobel_x = torch.tensor([[-1., 0., 1.],
                                 [-2., 0., 2.],
                                 [-1., 0., 1.]], device=img.device).view(1, 1, 3, 3)
        sobel_y = torch.tensor([[-1., -2., -1.],
                                 [ 0.,  0.,  0.],
                                 [ 1.,  2.,  1.]], device=img.device).view(1, 1, 3, 3)
        gray = img.mean(dim=1, keepdim=True)
        edge_x = F.conv2d(gray, sobel_x, padding=1)
        edge_y = F.conv2d(gray, sobel_y, padding=1)
        edges = torch.sqrt(edge_x ** 2 + edge_y ** 2)
        return edges
    
    def edge_loss(generated, sketch):
        """
        Compute an L1 loss between the edge maps of the generated image and the input sketch.
        generated, sketch: Tensors of shape (B, C, H, W) with values in [0, 1]
        """
        gen_edges = compute_sobel_edges(generated)
        sketch_edges = compute_sobel_edges(sketch)
        return F.l1_loss(gen_edges, sketch_edges)
      


(2) Texture loss: Texture loss uses a style loss computed via Gram matrices extracted from the pretrained network, to capture and enforce similar texture patterns between the generated image and a reference texture image. what it does, it encourages the generated image to adopt the local texture statistics, such as color distributions and patterns of the provided texture reference, improving the overall style consistency of the output.


0_data

4_data.png

0_mask

4_mask.png

0_250

4_250.png

0_500

4_500.png

0_500

4_1000.png

0_1000

Refrence texture image

By applying a high texture weight (1.0) alongside minimal edge and shape weights, the generated images strongly incorporate patterns from the reference zebra like texture, causing the cat’s fur and background to adopt striping or bold textural elements. With lower edge/shape constraints, the sketch outlines have less influence on structure, allowing the texture to dominate the final style.

We place a call in the Criterion and by setting weight for edge constraint loss to 1

  class TextureLoss(nn.Module):
    def __init__(self, layers=[3, 8, 17, 26]):  # example layer indices
        super(TextureLoss, self).__init__()
        # use a pretrained VGG19 feature extractor
        self.vgg = vgg19(pretrained=True).features.eval().to(device)
        for param in self.vgg.parameters():
            param.requires_grad = False
        self.layers = layers

    def gram_matrix(self, features):
        B, C, H, W = features.size()
        features = features.view(B, C, H * W)
        gram = torch.bmm(features, features.transpose(1, 2))
        return gram / (C * H * W)

    def forward(self, generated, texture):
        loss = 0.0
        x = generated
        y = texture
        gen_features = []
        tex_features = []
        for i, layer in enumerate(self.vgg):
            x = layer(x)
            y = layer(y)
            if i in self.layers:
                gen_features.append(x)
                tex_features.append(y)
        for gf, tf in zip(gen_features, tex_features):
            loss += F.mse_loss(self.gram_matrix(gf), self.gram_matrix(tf))
        return loss